Introduction

The main goal of this analysis was to perform most common algorithm used to observe what people purchase.

This dataset gives us information about the things people purchase when they go to a shop.

The data is taken from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour

Then we do reproducibility of my methods on different datasets: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1

Load the data

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
setwd("C:/Users/wangz/Desktop")
md = read.transactions("dataset.csv",format = "basket",
                                sep = ",",skip = 0, header = TRUE)
dim(md)
## [1] 1498   38
#average number of items 
ave_size = mean(size(md));
ave_size 
## [1] 10.34913
summary(md)
## transactions as itemMatrix in sparse format with
##  1498 rows (elements/itemsets/transactions) and
##  38 columns (items) and a density of 0.2723456 
## 
## most frequent items:
## vegetables    poultry    waffles     bagels lunch meat    (Other) 
##        894        431        418        417        413      12930 
## 
## element (itemset/transaction) length distribution:
## sizes
##   3   4   5   6   7   8   9  10  11  12  13  14 
##   8  57  51  51  71  74  95 191 304 320 212  64 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   11.00   10.35   12.00   14.00 
## 
## includes extended item information - examples:
##          labels
## 1  all- purpose
## 2 aluminum foil
## 3        bagels

Check what products appear most/least often, and get visualization and plots

# relative frequency
round(itemFrequency(md, type="relative"),4)
##                 all- purpose                aluminum foil 
##                       0.2630                       0.2637 
##                       bagels                         beef 
##                       0.2784                       0.2623 
##                       butter                      cereals 
##                       0.2610                       0.2737 
##                      cheeses                   coffee/tea 
##                       0.2603                       0.2630 
##                 dinner rolls dishwashing liquid/detergent 
##                       0.2583                       0.2684 
##                         eggs                        flour 
##                       0.2690                       0.2570 
##                       fruits                    hand soap 
##                       0.2637                       0.2377 
##                    ice cream             individual meals 
##                       0.2750                       0.2717 
##                        juice                      ketchup 
##                       0.2577                       0.2503 
##            laundry detergent                   lunch meat 
##                       0.2644                       0.2757 
##                         milk                        mixes 
##                       0.2710                       0.2737 
##                 paper towels                        pasta 
##                       0.2550                       0.2717 
##                         pork                      poultry 
##                       0.2497                       0.2877 
##                sandwich bags              sandwich loaves 
##                       0.2497                       0.2490 
##                      shampoo                         soap 
##                       0.2477                       0.2657 
##                         soda              spaghetti sauce 
##                       0.2737                       0.2543 
##                        sugar                 toilet paper 
##                       0.2670                       0.2704 
##                    tortillas                   vegetables 
##                       0.2443                       0.5968 
##                      waffles                       yogurt 
##                       0.2790                       0.2684
# plot for relative frequency
itemFrequencyPlot(
  md,
  topN = 10,
  type = "relative",
  main = "Item frequency",
  cex.names = 0.85
)

#absolute frequency
itemFrequency(md, type="absolute")
##                 all- purpose                aluminum foil 
##                          394                          395 
##                       bagels                         beef 
##                          417                          393 
##                       butter                      cereals 
##                          391                          410 
##                      cheeses                   coffee/tea 
##                          390                          394 
##                 dinner rolls dishwashing liquid/detergent 
##                          387                          402 
##                         eggs                        flour 
##                          403                          385 
##                       fruits                    hand soap 
##                          395                          356 
##                    ice cream             individual meals 
##                          412                          407 
##                        juice                      ketchup 
##                          386                          375 
##            laundry detergent                   lunch meat 
##                          396                          413 
##                         milk                        mixes 
##                          406                          410 
##                 paper towels                        pasta 
##                          382                          407 
##                         pork                      poultry 
##                          374                          431 
##                sandwich bags              sandwich loaves 
##                          374                          373 
##                      shampoo                         soap 
##                          371                          398 
##                         soda              spaghetti sauce 
##                          410                          381 
##                        sugar                 toilet paper 
##                          400                          405 
##                    tortillas                   vegetables 
##                          366                          894 
##                      waffles                       yogurt 
##                          418                          402
#plot for absolute frequency
itemFrequencyPlot(
  md,
  topN = 10,
  type = "absolute",
  main = "Item frequency",
  cex.names = 0.85
)

The figure above shows the 10 most popular purchases. Vegetables is first, then poultry and waffles.

#Plot for min support
itemFrequencyPlot(md, support = 0.1) #minimum support at 10%

Association rules

Global rules calculations

I use the Apriori algorithm. To simplify the analysis, I used the values: Confidence = 0.4, support = 0.1 After calculations, the algorithm found 38 rules.

rules = apriori(md, parameter = list(supp = 0.1, conf = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 149 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1498 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support

Support is a measure of how often a certain subset of items appeared in the whole data.

rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {}                  => {vegetables} 0.5967957 0.5967957  1.0000000 1.000000
## [2] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [3] {poultry}           => {vegetables} 0.1748999 0.6078886  0.2877170 1.018587
## [4] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [5] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [6] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
##     count
## [1] 894  
## [2] 264  
## [3] 262  
## [4] 259  
## [5] 257  
## [6] 255

Confidence

Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket.

rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [3] {eggs}              => {vegetables} 0.1695594 0.6302730  0.2690254 1.056095
## [4] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [5] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
## [6] {flour}             => {vegetables} 0.1595461 0.6207792  0.2570093 1.040187
##     count
## [1] 264  
## [2] 259  
## [3] 254  
## [4] 257  
## [5] 255  
## [6] 239

Lift

Lift is understood as a measure of sorts correlation.

rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [3] {eggs}              => {vegetables} 0.1695594 0.6302730  0.2690254 1.056095
## [4] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [5] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
## [6] {flour}             => {vegetables} 0.1595461 0.6207792  0.2570093 1.040187
##     count
## [1] 264  
## [2] 259  
## [3] 254  
## [4] 257  
## [5] 255  
## [6] 239

we can see the result, for all of them, Lift values are higher than 1. So we can say that rhs products are more likely to be bought with other products (lhs list) than if they were independent.

plot(rules, engine="plotly")

Change rhs to another product–Ice cream rules calculation

In our data, vegetables is the most frequent product in the basket analysis, we cannot observe any rules. So let’s use another product as our rhs: I will take Ice cream

rules_ice_cream = apriori(
    data = md,
    parameter = list(supp = 0.01, conf = 0.4),
    appearance = list(default = "lhs", rhs = "ice cream"),
    control = list(verbose = F)
  )
rules_ice_cream_table = inspect(rules_ice_cream, linebreak = FALSE)
##      lhs                                                  rhs        
## [1]  {hand soap,spaghetti sauce,vegetables}            => {ice cream}
## [2]  {cereals,paper towels,sandwich loaves}            => {ice cream}
## [3]  {all- purpose,lunch meat,spaghetti sauce}         => {ice cream}
## [4]  {aluminum foil,pasta,spaghetti sauce}             => {ice cream}
## [5]  {dishwashing liquid/detergent,flour,paper towels} => {ice cream}
## [6]  {aluminum foil,paper towels,soda}                 => {ice cream}
## [7]  {aluminum foil,coffee/tea,soda}                   => {ice cream}
## [8]  {aluminum foil,juice,milk}                        => {ice cream}
## [9]  {aluminum foil,beef,yogurt}                       => {ice cream}
## [10] {aluminum foil,beef,vegetables}                   => {ice cream}
## [11] {aluminum foil,milk,toilet paper}                 => {ice cream}
##      support    confidence coverage   lift     count
## [1]  0.01001335 0.4054054  0.02469960 1.474023 15   
## [2]  0.01001335 0.4838710  0.02069426 1.759317 15   
## [3]  0.01001335 0.4054054  0.02469960 1.474023 15   
## [4]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [5]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [6]  0.01001335 0.4838710  0.02069426 1.759317 15   
## [7]  0.01134846 0.4594595  0.02469960 1.670559 17   
## [8]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [9]  0.01001335 0.4545455  0.02202937 1.652692 15   
## [10] 0.01802403 0.4576271  0.03938585 1.663897 27   
## [11] 0.01134846 0.4358974  0.02603471 1.584889 17

Due to fewer transactions of this type, I reduce the initial support value to 0.01.

Due to the small sample, there is no clear pattern between the results of the analysis

plot(rules_ice_cream, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_ice_cream, method="graph") 

In this paper, I used mainly Apriori method for association rules. Despite results is not very good, I think that Association Rules are an interesting method of data analysis.

Use the same methods to analyze another dataset

library(kableExtra)
library(arules)
library(arulesViz)
transactions = read.transactions(
  "Market_Basket_Optimisation.csv",
  format = "basket",
  sep = ",",
  skip = 0,
  header = TRUE
)
transactions
## transactions in sparse format with
##  7500 transactions (rows) and
##  119 items (columns)
itemFrequencyPlot(
  transactions,
  topN = 20,
  type = "absolute",
  main = "Item frequency",
  cex.names = 0.85
)

rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Association rules

Global rules calculations

rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support

rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
##     lhs                            rhs             support    confidence
## [1] {ground beef}               => {mineral water} 0.04093333 0.4165536 
## [2] {olive oil}                 => {mineral water} 0.02746667 0.4178499 
## [3] {soup}                      => {mineral water} 0.02306667 0.4564644 
## [4] {ground beef,spaghetti}     => {mineral water} 0.01706667 0.4353741 
## [5] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [6] {chocolate,spaghetti}       => {mineral water} 0.01586667 0.4047619 
##     coverage   lift     count
## [1] 0.09826667 1.748266 307  
## [2] 0.06573333 1.753707 206  
## [3] 0.05053333 1.915771 173  
## [4] 0.03920000 1.827256 128  
## [5] 0.04093333 2.394361 128  
## [6] 0.03920000 1.698777 119
rules_supp_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {ground beef} => {mineral water} 0.0409333 0.4165536 0.0982667 1.748266 307
[2] {olive oil} => {mineral water} 0.0274667 0.4178499 0.0657333 1.753707 206
[3] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173
[4] {ground beef,spaghetti} => {mineral water} 0.0170667 0.4353741 0.0392000 1.827256 128
[5] {ground beef,mineral water} => {spaghetti} 0.0170667 0.4169381 0.0409333 2.394361 128
[6] {chocolate,spaghetti} => {mineral water} 0.0158667 0.4047619 0.0392000 1.698777 119

Confidence

rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
##     lhs                         rhs             support    confidence
## [1] {eggs,ground beef}       => {mineral water} 0.01013333 0.5066667 
## [2] {ground beef,milk}       => {mineral water} 0.01106667 0.5030303 
## [3] {chocolate,ground beef}  => {mineral water} 0.01093333 0.4739884 
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266 
## [5] {soup}                   => {mineral water} 0.02306667 0.4564644 
## [6] {pancakes,spaghetti}     => {mineral water} 0.01146667 0.4550265 
##     coverage   lift     count
## [1] 0.02000000 2.126469  76  
## [2] 0.02200000 2.111207  83  
## [3] 0.02306667 1.989319  82  
## [4] 0.02360000 1.968075  83  
## [5] 0.05053333 1.915771 173  
## [6] 0.02520000 1.909736  86
rules_conf_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {eggs,ground beef} => {mineral water} 0.0101333 0.5066667 0.0200000 2.126469 76
[2] {ground beef,milk} => {mineral water} 0.0110667 0.5030303 0.0220000 2.111207 83
[3] {chocolate,ground beef} => {mineral water} 0.0109333 0.4739884 0.0230667 1.989319 82
[4] {frozen vegetables,milk} => {mineral water} 0.0110667 0.4689266 0.0236000 1.968074 83
[5] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173
[6] {pancakes,spaghetti} => {mineral water} 0.0114667 0.4550265 0.0252000 1.909736 86

Lift

rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
##     lhs                            rhs             support    confidence
## [1] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [2] {eggs,ground beef}          => {mineral water} 0.01013333 0.5066667 
## [3] {ground beef,milk}          => {mineral water} 0.01106667 0.5030303 
## [4] {chocolate,ground beef}     => {mineral water} 0.01093333 0.4739884 
## [5] {frozen vegetables,milk}    => {mineral water} 0.01106667 0.4689266 
## [6] {soup}                      => {mineral water} 0.02306667 0.4564644 
##     coverage   lift     count
## [1] 0.04093333 2.394361 128  
## [2] 0.02000000 2.126469  76  
## [3] 0.02200000 2.111207  83  
## [4] 0.02306667 1.989319  82  
## [5] 0.02360000 1.968075  83  
## [6] 0.05053333 1.915771 173
rules_lift_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {ground beef,mineral water} => {spaghetti} 0.0170667 0.4169381 0.0409333 2.394361 128
[2] {eggs,ground beef} => {mineral water} 0.0101333 0.5066667 0.0200000 2.126469 76
[3] {ground beef,milk} => {mineral water} 0.0110667 0.5030303 0.0220000 2.111207 83
[4] {chocolate,ground beef} => {mineral water} 0.0109333 0.4739884 0.0230667 1.989319 82
[5] {frozen vegetables,milk} => {mineral water} 0.0110667 0.4689266 0.0236000 1.968074 83
[6] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173
plot(rules, engine="plotly")

Chocolate rules calculation

rules_chocolate = apriori(
    data = transactions,
    parameter = list(supp = 0.001, conf = 0.7),
    appearance = list(default = "lhs", rhs = "chocolate"),
    control = list(verbose = F)
  )
rules_chocolate_table = inspect(rules_chocolate, linebreak = FALSE)
##     lhs                                                  rhs        
## [1] {red wine,tomato sauce}                           => {chocolate}
## [2] {almonds,olive oil,spaghetti}                     => {chocolate}
## [3] {almonds,milk,spaghetti}                          => {chocolate}
## [4] {escalope,french fries,shrimp}                    => {chocolate}
## [5] {burgers,olive oil,pancakes}                      => {chocolate}
## [6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate}
##     support     confidence coverage    lift     count
## [1] 0.001066667 0.8000000  0.001333333 4.882018 8    
## [2] 0.001066667 0.7272727  0.001466667 4.438198 8    
## [3] 0.001066667 0.7272727  0.001466667 4.438198 8    
## [4] 0.001066667 0.8888889  0.001200000 5.424464 8    
## [5] 0.001200000 0.7500000  0.001600000 4.576892 9    
## [6] 0.001066667 0.7272727  0.001466667 4.438198 8
rules_chocolate_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {red wine,tomato sauce} => {chocolate} 0.0010667 0.8000000 0.0013333 4.882018 8
[2] {almonds,olive oil,spaghetti} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
[3] {almonds,milk,spaghetti} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
[4] {escalope,french fries,shrimp} => {chocolate} 0.0010667 0.8888889 0.0012000 5.424464 8
[5] {burgers,olive oil,pancakes} => {chocolate} 0.0012000 0.7500000 0.0016000 4.576892 9
[6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
plot(rules_chocolate, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_chocolate, method="graph") 

Conclusions

From this project,we could see that Association Rules are an extremely interesting method of data analysis which can relatively easily find out about many interesting relationships. And also, I did Reproducible Research by using same methods for another datasets, which prove the reproducibility of my code.